Adaptive Data Replication Scheme Based on Access Count Prediction in Hadoop

نویسندگان

Jungha Lee

JongBeom Lim

Daeyong Jung

KwangSik Chung

JoonMin Gil

چکیده

Hadoop, an open source implementation of the MapReduce framework, has been widely used for processing massive-scale data in parallel. Since Hadoop uses a distributed file system, called HDFS, the data locality problem often happens (i.e., a data block should be copied to the processing node when a processing node does not possess the data block in its local storage), and this problem leads to the decrease in performance. In this paper, we present an Adaptive Data Replication scheme based on Access count Prediction (ADRAP) in a Hadoop framework to address the data locality problem. The proposed data replication scheme predicts the next access count of data files using Lagrange’s interpolation with the previous data access count. With the predicted data access count, our adaptive data replication scheme determines whether it generates a new replica or it uses the loaded data as cache selectively, optimizing the replication factor. Furthermore, we provide a replica placement algorithm to improve data locality effectively. Performance evaluations show that our adaptive data replication scheme reduces the task completion time in the map phase by 9.6% on average, compared to the default data replication setting in Hadoop. With regard to data locality, our scheme offers the increase of node locality by 6.1% and the decrease of rack and rack-off locality by 45.6% and 56.5%, respectively. Hadoop, Data locality, Access prediction, Data replication, Data placement

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Data Replication Scheme based on Hadoop Distributed File System

Hadoop distributed file system (HDFS) is designed to store huge data set reliably, has been widely used for processing massive-scale data in parallel. In HDFS, the data locality problem is one of critical problem that causes the performance decrement of a file system. To solve the data locality problem, we propose an efficient data replication scheme based on access count prediction in a Hadoop...

متن کامل

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File System termed as HDFS. The HDFS similar to most distributed file systems sh...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Dynamic Replication based on Firefly Algorithm in Data Grid

In data grid, using reservation is accepted to provide scheduling and service quality. Users need to have an access to the stored data in geographical environment, which can be solved by using replication, and an action taken to reach certainty. As a result, users are directed toward the nearest version to access information. The most important point is to know in which sites and distributed sy...

متن کامل

Delay Scheduling Based Replication Scheme for Hadoop Distributed File System

The data generated and processed by modern computing systems burgeon rapidly. MapReduce is an important programming model for large scale data intensive applications. Hadoop is a popular open source implementation of MapReduce and Google File System (GFS). The scalability and fault-tolerance feature of Hadoop makes it as a standard for BigData processing. Hadoop uses Hadoop Distributed File Sys...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Adaptive Data Replication Scheme Based on Access Count Prediction in Hadoop

نویسندگان

چکیده

منابع مشابه

Efficient Data Replication Scheme based on Hadoop Distributed File System

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Dynamic Replication based on Firefly Algorithm in Data Grid

Delay Scheduling Based Replication Scheme for Hadoop Distributed File System

عنوان ژورنال:

اشتراک گذاری